Class Workbook

Data Sets

German Credit Data

A German financial company wants to create a model that predicts the defaults on consumer loans in the German market. When you are called in, the company has already built a model and asks you to evaluate it because there is a concern that this model unfairly evaluates young customers. Your task is to figure out if this is true and to devise a way to correct this problem. The data used to make predictions as well as the predictions can be found in germancredit data.

The data contains the outcome of interest BAD indicating whether a customer has defaulted on a loan. A model to predict default has already been fit and predicted probabilities of default (probability) and predicted status coded as yes/no for default (predicted) have been concatenated to the original data.

Casual assessment

Let’s look at the prediction made by some model that was fit.

Here is the confusion matrix:

##                 
## model_pred       BAD GOOD
##   PredYesDefault 121   57
##   PredNoDefault  179  643
  Actual 1. Actual 0
Pred 1 TP=121 FP =57
Pred 0 FN=179 TN =643

It looks OK.

Here is the ROC curve:

library(pROC)
roc_score=roc(germancredit$BAD, germancredit$probability) #AUC score
plot(roc_score ,main ="ROC curve -- Logistic Regression ")

Again, pretty good.

The decile plot looks decent as well.

lift <- function(depvar, predcol, groups=10) {
  if(is.factor(depvar)) depvar <- as.integer(as.character(depvar))
  if(is.factor(predcol)) predcol <- as.integer(as.character(predcol))
  helper = data.frame(cbind(depvar, predcol))
  helper[,"bucket"] = ntile(-helper[,"predcol"], groups)
  gaintable = helper %>% group_by(bucket)  %>%
    summarise_at(vars(depvar), list(total = ~n(),
    totalresp=~sum(., na.rm = TRUE))) %>%
    mutate(Cumresp = cumsum(totalresp),
    Gain=Cumresp/sum(totalresp)*100,
    Cumlift=Gain/(bucket*(100/groups)))
  return(gaintable)
}
library(dplyr)
default = 1*(germancredit$BAD=="BAD")
revP =germancredit$probability
dt = lift(default, revP, groups = 10)
barplot(dt$totalresp/dt$total,  ylab="Decile", xlab="Bucket")
abline(h=mean(default ),lty=2,col="red")

The residual does show a little concerning point on the right but it’s not obvious.

So overall, if you just used the traditional evaluation method, you would conclude that there is no problem.

Digging deeper

Now, let’s look at this by age and gender.

germancredit$Age_cat=  cut(germancredit$Age,c(0,25,35,45,75))
germancredit$FemaleAge_cat= germancredit$Female: germancredit$Age_cat
ggplot(germancredit)+geom_bar()+aes(x=Age_cat,fill=BAD)+facet_grid(~Female)

You see more females in 25-50 range is represented in the dataset.

The proportion of females that actually defaulted is lower than that of males.

The mosaic plot shows some discrepancy in the default probability by age group.

## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
##   Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The confusion matrix conditioned on age:

## , ,  = (0,25]
## 
##                 
## model_pred       BAD GOOD
##   PredYesDefault  36   15
##   PredNoDefault   44   95
## 
## , ,  = (25,35]
## 
##                 
## model_pred       BAD GOOD
##   PredYesDefault  47   25
##   PredNoDefault   71  255
## 
## , ,  = (35,45]
## 
##                 
## model_pred       BAD GOOD
##   PredYesDefault  15    7
##   PredNoDefault   40  164
## 
## , ,  = (45,75]
## 
##                 
## model_pred       BAD GOOD
##   PredYesDefault  23   10
##   PredNoDefault   24  129

COMPAS

In the US, judges, probation officers, and parole officers use algorithms to evaluate the likelihood of a criminal defendant re-offending, a concept commonly referred to as recidivism. Numerous risk assessment algorithms are circulating with two prominent nationwide tools provided by commercial vendors.

One of these tools, Northpointe’s COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), has made national headlines about how it seems to have a bias towards certain protected groups. Your job is to figure out if this is the case.

https://github.com/propublica/compas-analysis/

## Help on topic 'compas' was found in the following packages:
## 
##   Package               Library
##   fairmodels            /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
##   fairness              /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
##   mlr3fairness          /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
## 
## 
## Using the first match ...

The variable of interest is the two_year_recid, indicating if the individual committed a crime within two years.

## 
##    0    1 
## 3363 2809

There are two score for recidivism risk

##         
##             0    1
##   Low    2345 1076
##   Medium  721  886
##   High    297  847
## Warning: Removed 2 rows containing missing values (`geom_bar()`).

Another is the decile score

##     
##         0    1
##   1  1009  277
##   2   558  264
##   3   403  244
##   4   375  291
##   5   302  280
##   6   221  308
##   7   198  298
##   8   118  302
##   9   120  300
##   10   59  245

If you look at the risk and the outcome by race you can see discrepancies.

How would you evaluate the COMPAS result?

Adult Census Data

The dataset used to predict whether income exceeds $50K/yr based on census data. Also known as the “Census Income” dataset Train dataset contains 13 features and 30178 observations. Test dataset contains 13 features and 15315 observations. Target column is “target”: A binary factor where 1: <=50K and 2: >50K for annual income. The column “sex” is set as a protected attribute.

Here are the EDA result.

## Warning: attributes are not identical across measure variables; they will be
## dropped

Researchers wants to know who makes more money. So they fit a logistic regression.

How does the residuals look like?

How about a decile plot?

What is the conclusion? Is there a problem?

Diabetes dataset

The diabetes dataset describes the clinical care at 130 US hospitals and integrated delivery networks from 1999 to 2008. The classification task is to predict whether a patient will readmit within 30 days.

https://fairlearn.org/main/user_guide/datasets/diabetes_hospital_data.html https://www.hindawi.com/journals/bmri/2014/781670/

We grabbed the preprocessed data so you don’t need to clean it.

The target is readmit_30_days, which is a binary attribute that indicates whether the patient was readmitted within 30 days.

## 
##     0     1 
## 90409 11357

The researchers fit a glm model.

ROC

library(pROC)
roc_score=roc(diabetic$readmit_30_days, diabetes_glm_model$fitted) #AUC score
plot(roc_score ,main ="ROC curve -- Logistic Regression ")

How does the residuals look like?

The confusion matrix is not useful since by default cutoff of 0.5 everyone is predicted as 0.

##    
##         0     1
##   0 90409 11357

How about a decile plot?

You see that the model is capturing something. Do you see any problem with this model with protected attributes such as race and gender?

In class activity

Choose one of the data described above as your target problem.

I choose Diabetes dataset.

Discuss what a favorable label in this problem is and what does a favorable label grant the affected user? Is it assertive or non-punitive?

The favorable label here is a prediction that a patient will not be readmitted within 30 days (readmit_30_days = 0).
It is non-punitive because its goal is to identify patients at risk of readmission to improve care and manage resources efficiently, not to penalize patients for their health status.

What type of justice is this issue about?

It's about distributive justice.
This justice ensures that all patients are assessed equitably based on their data, without bias or unfair treatment.

Discuss the potential concerns about the data being used.

Bias: Risks due to demographic data (race, gender, age).
Privacy: Sensitive health information requires protection.
Completeness: Gaps in data, e.g., "Missing" in medical_specialty.

Discuss what type of group fairness metrics is appropriate for this problem.

Statistical Parity Difference is suitable for this problem. It would measure the difference in the rate of being predicted not to be readmitted (favorable outcome) between the privileged group (e.g., whites) and the unprivileged group (e.g., blacks). A lower statistical parity difference indicates less disparity.

Using the appropriate fairness metrics, show if there are concerns in the prediction algorithm.

# Predict the probability of readmission within 30 days
diabetic$predicted_prob <- predict(diabetes_glm_model, type = "response")

# Privileged and Unprivileged groups
privileged <- diabetic$race == 'Caucasian'
unprivileged <- diabetic$race == 'AfricanAmerican'

# Selection rates for the favorable outcome (no readmission)
selection_rate_privileged <- mean(diabetic$predicted_prob[privileged])
selection_rate_unprivileged <- mean(diabetic$predicted_prob[unprivileged])

# Statistical Parity Difference
statistical_parity_difference <- selection_rate_privileged - selection_rate_unprivileged
statistical_parity_difference
## [1] 0.0007244016
The Statistical Parity Difference is 0.0007244016, which is pretty small. This result shows that there are no significant fairness concerns with respect to the racial groups considered here, in terms of the rate at which the prediction algorithm favors one group over the other in predicting 30-day readmissions.

Given that you have access to the original data, but not to the model used to make the prediction, discuss which mitigation strategy might be more appropriate to deal with the problem, if any.

We can start by balancing the dataset during pre-processing to ensure no group is under- or over-represented. Then, after the model has made its predictions, we might adjust the decision thresholds as a form of post-processing to balance the outcomes across different groups. It’s also important to establish a routine of continuously monitoring the model’s fairness metrics and making necessary adjustments over time. Finally, document each step we take to maintain fairness for the sake of transparency and accountability.

Fairness Metrics

Fairness metrics have several ways to classify them. Many fairness metrics for discrete outcomes are derived using the conditional confusion matrix. For each of the protected groups of interest, we can define a conditional confusion matrix as:

  Actual 1 Actual 0 \(\dots\) Actual 1 Actual 0
Pred 1 \(TP_{g1}\) \(FP_{g1}\) \(\dots\) \(TP_{g2}\) \(FP_{g2}\)
Pred 0 \(FN_{g1}\) \(TN_{g1}\) \(\dots\) \(FN_{g2}\) \(TN_{g2}\)

Depending on the context different metrics are appropriate.

Definitions Based on Predicted Outcome That Does not require Actual outcomes

Demographic parity (Statistical Parity, Equal Parity, Equal Acceptance Rate or Independence)

Demographic parity is one of the most popular fairness indicators in the literature.

Demographic parity is achieved if the absolute number of positive predictions in the subgroups are close to each other. \[(TP_g + FP_g)\] This measure does not take true class into consideration and only depends on the model predictions. In some literature, demographic parity is also referred to as statistical parity or independence.

##                       (0,25]    (25,35]     (35,45]     (45,75]
## Positively classified     51  72.000000  22.0000000  33.0000000
## Demographic Parity         1   1.411765   0.4313725   0.6470588
## Group size               190 398.000000 226.0000000 186.0000000

Of course, comparing the absolute number of positive predictions will show a high disparity when the number of cases within each group is different, which artificially boosts the disparity. This is true in our case:

## 
## Female   Male 
##    690    310

Proportional parity (Impact Parity or Minimizing Disparate Impact) [Calders and Verwer 2010]

Proportional parity is calculated based on the comparison of the proportion of all positively classified individuals in all subgroups of the data. \[(TP_g + FP_g) / (TP_g + FP_g + TN_g + FN_g)\] Proportional parity is very similar to demographic parity but modifies it to address the issue that when the number of cases within each group is different, which artificially boosts the disparity. In some literature, proportional parity and demographic parity are considered equivalent, which is true when the protected group sizes are equivalent. Proportional parity is achieved if the proportion of positive predictions in the subgroups are close to each other. Similar to the demographic parity, this measure also does not depend on the true labels.

In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their proportion of positively predicted observations are lower or higher compared to the reference group. Lower proportions will be reflected in numbers lower than 1 in the returned named vector.

##                          (0,25]     (25,35]      (35,45]     (45,75]
## Proportion            0.2684211   0.1809045   0.09734513   0.1774194
## Proportional Parity   1.0000000   0.6739580   0.36265834   0.6609741
## Group size          190.0000000 398.0000000 226.00000000 186.0000000

Definitions Based on Predicted and Actual Outcomes

Predictive rate parity

Predictive rate parity is achieved if the precisions (or positive predictive values) in the subgroups are close to each other. The precision stands for the number of the true positives divided by the total number of examples predicted positive within a group. \[TP_g / (TP_g + FP_g)\]

##                             (0,25]     (25,35]     (35,45]     (45,75]
## Precision                0.2941176   0.3472222   0.3181818   0.3030303
## Predictive Rate Parity   1.0000000   1.1805556   1.0818182   1.0303030
## Group size             190.0000000 398.0000000 226.0000000 186.0000000

The first row shows the raw precision values for the age groups. The second row displays the relative previsions compared to a 0-25 age group.

In a perfect world, all predictive rate parities should be equal to one, which would mean that precision in every group is the same as in the base group. In practice, values are going to be different. The parity above one indicates that precision in this group is relatively higher, whereas a lower parity implies a lower precision. Observing a large variance in parities should hint that the model is not performing equally well for different age groups.

The result suggests that the model is worse for younger people. This implies that there are more cases where the model mistakingly predicts that a person will default if they are young.

If the middle aged group is set as a base group, the raw precision values do not change, only the relative metrics will change.

##                            (25,35]      (0,25]     (35,45]     (45,75]
## Precision                0.3472222   0.2941176   0.3181818   0.3030303
## Predictive Rate Parity   1.0000000   0.8470588   0.9163636   0.8727273
## Group size             398.0000000 190.0000000 226.0000000 186.0000000

False negative rate parity [Chouldechova 2017]

False negative rates are calculated by the division of false negatives with all positives (irrespective of predicted values). \[FN_g / (TP_g + FN_g)\] False negative rate parity is achieved if the false negative rates (the ratio between the number of false negatives and the total number of positives) in the subgroups are close to each other.

In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their false negative rates are lower or higher compared to the reference group. Lower false negative error rates will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean BETTER prediction for the subgroup.

##                 (0,25]     (25,35]     (35,45]     (45,75]
## FNR          0.8636364   0.9107143   0.9590643   0.9280576
## FNR Parity   1.0000000   1.0545113   1.1104955   1.0745930
## Group size 190.0000000 398.0000000 226.0000000 186.0000000

False positive rate parity [Chouldechova 2017]

False positive rates are calculated by the division of false positives with all negatives (irrespective of predicted values). \[FP_g / (TN_g + FP_g)\] False positive rate parity is achieved if the false positive rates (the ratio between the number of false positives and the total number of negatives) in the subgroups are close to each other.

In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their false positive rates are lower or higher compared to the reference group. Lower false positives error rates will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean BETTER prediction for the subgroup.

##            (0,25]     (25,35]     (35,45]     (45,75]
## FPR          0.45   0.3983051   0.2727273   0.4893617
## FPR Parity   1.00   0.8851224   0.6060606   1.0874704
## Group size 190.00 398.0000000 226.0000000 186.0000000

Equalized odds (Equal Opportunity, Positive Rate Parity or Separation)

Equalized Odds are calculated by the division of true positives with all positives (irrespective of predicted values). \[TP_g / (TP_g + FN_g)\] This metrics equals to what is traditionally known as sensitivity.

In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their sensitivities are lower or higher compared to the reference group. Lower sensitivities will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup. Equalized odds are achieved if the sensitivities in the subgroups are close to each other.

##                     (0,25]      (25,35]      (35,45]      (45,75]
## Sensitivity      0.1363636   0.08928571   0.04093567   0.07194245
## Equalized odds   1.0000000   0.65476190   0.30019493   0.52757794
## Group size     190.0000000 398.00000000 226.00000000 186.00000000

Accuracy parity [Friedler et al., 2018]

Accuracy metrics are calculated by the division of correctly predicted observations (the sum of all true positives and true negatives) with the number of all predictions. \[(TP_g + TN_g) / (TP_g + FP_g + TN_g + FN_g)\] Accuracy parity is achieved if the accuracies (all accurately classified examples divided by the total number of examples) in the subgroups are close to each other.

In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their accuracies are lower or higher compared to the reference group. Lower accuracies will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup.

##                      (0,25]     (25,35]     (35,45]     (45,75]
## Accuracy          0.3105263   0.2412060   0.2079646   0.1827957
## Accuracy Parity   1.0000000   0.7767652   0.6697165   0.5886641
## Group size      190.0000000 398.0000000 226.0000000 186.0000000

Negative predictive value parity

Negative predictive value parity can be considered the ‘inverse’ of the predictive rate parity. Negative Predictive Values are calculated by the division of true negatives with all predicted negatives. \[TN / (TN + FN)\] Negative predictive value parity is achieved if the negative predictive values in the subgroups are close to each other.

In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their negative predictive values are lower or higher compared to the reference group. Lower negative predictive values will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup.

##                 (0,25]     (25,35]     (35,45]     (45,75]
## NPV          0.3165468   0.2177914   0.1960784   0.1568627
## NPV Parity   1.0000000   0.6880229   0.6194296   0.4955437
## Group size 190.0000000 398.0000000 226.0000000 186.0000000

Matthews correlation coefficient parity

In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their Matthews Correlation Coefficients are lower or higher compared to the reference group. Lower Matthews Correlation Coefficients rates will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup.

The Matthews correlation coefficient (MCC) considers all four classes of the confusion matrix. MCC is sometimes referred to as the single most powerful metric in binary classification problems, especially for data with class imbalances.

\[(TP_g×TN_g-FP_g×FN_g)/\sqrt{((TP_g+FP_g)×(TP_g+FN_g)×(TN_g+FP_g)×(TN_g+FN_g))}\]

##                 (0,25]     (25,35]     (35,45]     (45,75]
## MCC         -0.3494421  -0.3666323  -0.3355449  -0.4748169
## MCC Parity   1.0000000   1.0491931   0.9602303   1.3587854
## Group size 190.0000000 398.0000000 226.0000000 186.0000000

Specificity parity

Specificity parity can be considered the ‘inverse’ of the equalized odds. Specificity is calculated by the division of true negatives with all negatives (irrespective of predicted values). \[TN_g / (TN_g + FP_g)\]
Specificity parity is achieved if the specificity (the ratio of the number of the true negatives and the total number of negatives) in the subgroups are close to each other.

In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their specificity is lower or higher compared to the reference group. Lower specificity will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup.

##                    (0,25]     (25,35]     (35,45]     (45,75]
## Specificity          0.55   0.6016949   0.7272727   0.5106383
## Specificity Parity   1.00   1.0939908   1.3223140   0.9284333
## Group size         190.00 398.0000000 226.0000000 186.0000000

ROC AUC parity

The equality of the area under the ROC for different groups identified by protected attributes can be seen as analogous to the equality of accuracy.

This function computes the ROC AUC values for each subgroup. In the returned table, the reference group will be assigned 1, while all other groups will be assigned values according to whether their ROC AUC values are lower or higher compared to the reference group. Lower ROC AUC will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup.

This function calculates ROC AUC and visualizes ROC curves for all subgroups. Note that probabilities must be defined for this function. Also, as ROC evaluates all possible cutoffs, the cutoff argument is excluded from this function.

##                     (0,25]     (25,35]     (35,45]     (45,75]
## ROC AUC          0.7389773   0.7820218   0.8137161   0.8152457
## ROC AUC Parity   1.0000000   1.0582488   1.1011382   1.1032080
## Group size     190.0000000 398.0000000 226.0000000 186.0000000

Apart from the standard outputs, the function also returns ROC curves for each of the subgroups.

Software

A handful of software has been made available over the last few years. These are usually a combination of fairness metrics calculation, followed by visualizations.

Because they automate the process, they are useful if you can get them to work. Here is an example of using fairmodels.

———— step 1 - create model(s) —————–

We will look at the germancredit data again. But here we will create our model. As a comparison, let’s fit logistic regression and the random forest model.

———— step 2 - create explainer(s) ————

You need to create an explainer object.

## Preparation of a new explainer is initiated
##   -> model label       :  lm  (  default  )
##   -> data              :  1000  rows  21  cols 
##   -> target variable   :  1000  values 
##   -> predict function  :  yhat.glm  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package stats , ver. 4.3.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.05264789 , mean =  0.7 , max =  0.9983644  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.9774084 , mean =  6.610203e-13 , max =  0.9192795  
##   A new explainer has been created!
## Preparation of a new explainer is initiated
##   -> model label       :  ranger  (  default  )
##   -> data              :  1000  rows  21  cols 
##   -> target variable   :  1000  values 
##   -> predict function  :  yhat.ranger  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package ranger , ver. 0.16.0 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.1268095 , mean =  0.6975279 , max =  0.9932718  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.7460224 , mean =  0.002472066 , max =  0.5184032  
##   A new explainer has been created!

———— step 3 - fairness check —————–

You can run fairness check on one model. Which shows

## Creating fairness classification object
## -> Privileged subgroup       : character ( Ok  )
## -> Protected variable        : factor ( Ok  ) 
## -> Cutoff values for explainers  : 0.5 ( for all subgroups ) 
## -> Fairness objects      : 0 objects 
## -> Checking explainers       : 1 in total (  compatible  )
## -> Metric calculation        : 13/13 metrics calculated for all models
##  Fairness object created succesfully 
## 
## Fairness check for models: lm 
## 
## lm passes 3/5 metrics
## Total loss :  1.557322

Or you can compare the metrics for different models.

## Creating fairness classification object
## -> Privileged subgroup       : character ( Ok  )
## -> Protected variable        : factor ( Ok  ) 
## -> Cutoff values for explainers  : 0.5 ( for all subgroups ) 
## -> Fairness objects      : 0 objects 
## -> Checking explainers       : 2 in total (  compatible  )
## -> Metric calculation        : 10/13 metrics calculated for all models ( 3 NA created )
##  Fairness object created succesfully 
## 
## Fairness check for models: lm, ranger 
## 
## lm passes 3/5 metrics
## Total loss :  1.557322 
## 
## ranger passes 4/5 metrics
## Total loss :  1.460187

You can check this value for other variables as well.

## Creating fairness classification object
## -> Privileged subgroup       : character ( Ok  )
## -> Protected variable        : factor ( Ok  ) 
## -> Cutoff values for explainers  : 0.5 ( for all subgroups ) 
## -> Fairness objects      : 0 objects 
## -> Checking explainers       : 2 in total (  compatible  )
## -> Metric calculation        : 10/13 metrics calculated for all models ( 3 NA created )
##  Fairness object created succesfully 

Reference

https://cran.r-project.org/web/packages/fairness/vignettes/fairness.html https://ashryaagr.github.io/Fairness.jl/dev/datasets/

Calders, T., Verwer, S. Three naive Bayes approaches for discrimination-free classification. Data Min Knowl Disc 21, 277–292 (2010). https://doi.org/10.1007/s10618-010-0190-x